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ABSTRACT 

Motifs are the most repetitive/frequent patterns of a time- 
series. The discovery of motifs is crucial for practitioners in 
order to understand and interpret the phenomena occurring 
in sequential data. Currently, motifs are searched among 
series sub-sequences, aiming at selecting the most frequently 
occurring ones. Search-based methods, which try out series 
sub-sequence as motif candidates, are currently believed to 
be the best methods in finding the most frequent patterns. 

However, this paper proposes an entirely new perspective 
in finding motifs. We demonstrate that searching is non- 
optimal since the domain of motifs is restricted, and instead 
we propose a principled optimization approach able to find 
optimal motifs. We treat the occurrence frequency as a func¬ 
tion and time-series motifs as its parameters, therefore we 
learn the optimal motifs that maximize the frequency func¬ 
tion. In contrast to searching, our method is able to discover 
the most repetitive patterns (hence optimal), even in cases 
where they do not explicitly occur as sub-sequences. Ex¬ 
periments on several real-life time-series datasets show that 
the motifs found by our method are highly more frequent 
than the ones found through searching, for exactly the same 
distance threshold. 
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1. INTRODUCTION 

Time-series are arguably the most widespread type of data 
which occur in virtually all the application domains of our 
modern lives, wherever measurements have associated time 
stamps (e.g.: physiological and medical, financial, meteoro¬ 
logical, sound and video, monitoring system sensors, astron¬ 
omy light intensities, and many more ). 

In many cases, the underlying patterns of those datasets 
are not known to the domain practitioners and a visual in¬ 
spection is often infeasible given the complexity and size 
of the data. For this reason, finding the most repetitive 
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patterns in time-series help the domain experts understand 
the underlying phenomena within diverse sources of data 
[2 26 . The most repetitive time-series patterns are called 


motifs and their discovery has recently attracted consider¬ 
able research [25 20, 30, 14]. In brief terms, optimal mo¬ 
tifs are those which repeat the most (i.e. have the high¬ 
est frequency) given a distance/similarity threshold value. 
The approach of the current state-of-the-art motif discovery 
methods is to search the motifs from the segments (a.k.a 
sub-sequences) of time series |25||28| 14 (! 3j. More concretely, 
series segments are considered to be motif candidates and 
the most frequent segments are sorted out. 

In this paper we present an entirely new and orthogonally 
different perspective to the search-based approach. First 
of all, we treat frequency as a function and motifs as its vari¬ 
able. Naturally our task becomes finding the values of motifs 
which maximize the value of the frequency function. In this 
perspective we formalize motif discovery as a principled op¬ 
timization problem and devise an optimization technique to 
learn the optimal motifs. The learning process uses the first 
order derivative of the frequency function, in order to find its 
maximum. In that way, our method can learn motifs which 
yield the maximum frequency (a.k.a the highest number of 
matches). The proposed learning method is theoretically 
superior to the search-based approach, because in the case 
of searching the motif candidates are limited to the domain 
of sub-sequences and cannot discover latent series patterns 
(Section |4.1[ ) . 

As the empirical results (Section [6] over various real- 
life datasets will indicate, our optimal motifs have signifi¬ 
cantly more matches (higher frequency) than the ones found 
through searching, for exactly the same distance threshold. 


2. RELATED WORK 

The research on discovering time-series motifs has suffered 
from a terminological ambiguity. Initially, motifs were de¬ 
fined to be the most frequently occurring patterns in a time- 
series [25]. However, another stream of papers redefined the 


term ’’motif’ as the closest pair among series segments 23 


[21]. In this paper we mean ’’the most frequently occurring 
patterns” [25| when referring to motifs. The closest pair of 
series segments, on the other hand, will be referred to as 
”pair-motif’ following the suggestion of 19 . 


2.1 Pair-motif discovery 

The closest pair of series segments can be perceived as 
a sub-variation of the general motif discovery task. The 
brute-force search that computes the distance of every seg- 








ment pair is computationally expensive, therefore efforts are 
devoted towards scaling the brute force up. A fast, yet ex¬ 
act, method that discovers pair-wise motifs has been intro¬ 
duced by [23]. Enumerations of all motifs having variable 
lengths has also been researched 20 19 . In a streaming 


scenario an algorithm can not rely on accessing the full past 
series, therefore we need to find the top-k motif search via an 
on-line method as in 12]. In addition, the statistical signifi¬ 
cance of the motifs found has also been a topic of interest |4[ 

s- 

Note: Finding motif-pairs is equivalent to the problem of 
locating the closest pair of points in a geometrical space and 
is a historic problem in computational geometry [§]. 

2.2 Motif Discovery 

Repeating patterns in sequential data have initially been 
studied in bio-informatics [2j. However, finding motifs is 
beneficial in understanding physiological human data [26] , 
while being also useful in understanding behavioral patterns 
of living organisms |Ij. The concept of recurrent patterns 
was transferred to the realm of time-series data under the 
term ’’motifs” [25; and a search-based approach to discover¬ 
ing motifs was proposed. In order to find motifs that are 
immune to noisy variations, a probabilistic search of time- 
series motifs was based on random projections 7 . Another 
work has explored the employment of uniform scaling as 
the similarity distance used for discovering the motifs [28] . 
Furthermore, a hybrid combination of supervised and un¬ 
supervised learning has been used for searching recurring 
patterns [24]. The first step involves a teacher which labels 
whether or not a time series includes a particular pattern, 
while in the next step an unsupervised learning from the 
series in order to reconstruct the teacher is exploited. The 
task of finding the most recurring motifs has also been tack¬ 
led through searching for candidate motifs organized in a 
tree structure 15]. 

The brute-force approach which tries out every segment 
(sub-sequence) as a potential motif has a quadratic com¬ 
plexity in the number of segments. Therefore approximate 
motif discovery methods have been exploited. Conversion of 
motifs into a symbolic representation (named SAX) is a pre¬ 
processing alternative [10 . Over the new representation an 
agglomerative clustering can be used to find motifs lOj. A 
scalable alternative that can approximately discover multi¬ 
resolution motifs in a single scan utilizes different cardinal¬ 
ities of the symbolic representation 3 . Last but not least, 
a scalable version of the pair-wise motifs has been extended 
to the general motifs discovery for large-scale data 


22 


Given the widespread of multi-dimensional time series, 
there has also been interest in mining multi-dimensional 
motifs too. Several strategies were inspected, where motifs 
span all versus a subset of the dimensions, with or with¬ 
out temporal overlap 17 . The algorithm is based on ran¬ 
dom projections of the symbolic sub-sequence representa¬ 
tions 17 . Discovering regions of high density in the space of 
sub-sequencies is another alternative to mining multivariate 
motifs 18 . Graph clustering implemented as a two-staged 


algorithm was also employed in detecting multidimensional 
motifs [27]. In the first step single-dimensional motifs are 


discovered and later blended through clustering 27 


Since motifs are previously unknown patterns, there is 
little information on the motifs’ lengths too. Under such 
a reality authors attempted to discover the optimal motif 


length, for instance by inspecting the compressibility of the 
data [30 . In addition, variable-length motifs can be ex¬ 
tracted using a grammar-inspired inference process 13 . In¬ 
terest has been attracted in terms of visualizing variable- 
length motifs 14 , finding them in linear time [6], or using 
them for classification purposes [29] , 

In contrast to the related work, our novel contribu¬ 
tion relies in computing an optimal set of motifs given a 
threshold distance and the motifs’ length. We are the first 
to propose a principled optimization method for the task. As 
a consequence, our approach leads to significantly improved 
motif quality (frequency) compared to brute-force search. 

3. PRELIMINARIES 

3.1 Notations 

3.1.1 Time Series and Motifs 

A time series is a long ordered sequence of real-valued 
measurements. Such a series is abstracted as a list of J-many 
Z-normalized sliding-window segments of length L and is de¬ 
noted as S' £ Ht JxL . On the other hand, a repetitive pattern, 
a.k.a motif, is simply a sequence of L points. The definition 
can be generalized to a set of A'-motifs and consecutively 
denoted as M £ R Kxi . 

3.1.2 Motif Frequency 

The occurrence frequency of a motif is defined as the non¬ 
trivial (see Section |3.2.2| | number of matches between a motif 
and all the normalized segments of the time series. The cur¬ 
rent approach of counting the matching frequency of the fc-th 
motif, denoted Mk , ; £ R L , iterates over all the j £ {1 ,..., J} 
sliding window segments Sy, and check whether the motif of 
interest matches the segments within a threshold distance 
T £ R+. 


hm) = EE^ 


k,j 


k=lj=l 


J~k,j — 


1 if (zti - %0 2 ) < T 


0 otherwise 


(1) 


( 2 ) 


Equation [l] presents the formalism for the overall fre¬ 
quency as a sum of motifs’ frequencies, while Equation [2] 
encapsulates the concept of a match. If the distance between 
a segment Sj t - and a motif M*, j; is less than the threshold T, 
then a matching value of one is granted. 

3.2 Problem Definition 

3.2.1 Optimal Motifs 

Following the established literature definition, the only 
optimality criterion of a motif is its frequency at a particu¬ 
lar distance threshold. Therefore, the only legitimate metric 
to compare the qualities of motifs is frequency (a.k.a. sup¬ 
port, or number of matches). The optimal motifs M* for a 
time series are defined in Equation [3] as the candidate motifs 
M that achieve the maximum frequency value F(M ) from 
Equation [l] There is, nevertheless, an important constraint 
in the search for motifs: The K motifs should be differ¬ 
ent from each other [25], otherwise, the motifs risk being 
close variations of the single most repetitive motif. Such a 
constraint is presented under a ’’such that (s.t.)” clause in 











Equation[3] which enforces each pair of motifs {Mk,-., M p , ; ) 
to be different from each other by a distance of at least 2 T 
(so each pair does not overlap within a threshold T, details 
in [25]). 


M* := argmax T{M) 

mg i Kx L 


( 3 ) 


s.t.: {Mk ,i - Mp,i) 2 \ > 2 T, 

Vfc 6 K}, Vp£ {fc + 1,...,K} 


3.2.2 Trivial Matches 

Stated shortly, trivial matches are consecutive segments 
which match the same motif 25 . For instance, this case 


might happen if the sliding window is incremented by one. 
In that case two subsequent segments will share exactly L — 
1 points and therefore the distance of any motif to those 
close-by segments will be very similar. Some related work 
increment the sliding window by an offset of points, therefore 
trivial matches can be trans-passed at the risk of potentially 
missing certain matches |3 18 . However, in our paper all 
the reported figures on frequency do not include any trivial 
match throughout the experiments. 


3.3 Searching The Motifs 

The state-of-the-art methods referred in Section[2]focusing 
on searching motifs are primarily concerned with trying can¬ 
didate motifs from the series segments. Despite proposing 
important novelties in their scope (scalability, length anal¬ 
ysis, etc ...) still these techniques are upper bounded in 
terms of quality by the brute-force motif search. 

Algorithm[l]describes a speed-wise naive, yet qualitatively 
search-optimal implementation of a brute-force motif search. 
We can pre-compute the frequencies of all series segments in 
0{J 2 L) runtime complexity and then search the top-K mo¬ 
tifs using the computed frequencies in 0(I\ 2 JL) time. Since 
K is typically a small number compared to the segments 
J » K , therefore the overall brute-force search has a com¬ 
plexity of 0(J 2 L + K 2 JL) ~ 0{J 2 L), meaning quadratic 
in the number of segments. In this paper we propose a 
learning (not searching) method that outputs motifs having 
higher frequencies than those discovered by the brute-force 
approach. 


4. PROPOSED METHOD 
4.1 Motivation 

The state-of-the-art methods used for finding motifs are 
based on searching for the most frequently occurring can¬ 
didate segment. In other words, any motif has to explicitly 
occur as a series segments Mk,- £ S, Vfc £ Unfor¬ 

tunately, such constrained motifs are very restricted in the 
finite space of possible values they can have, compared to 
the space of real matrices M £ R KxL (infinitely more can¬ 
didates than M £ S). In this paper, we hypothesize and 
empirically show that the optimal motifs are located in the 
space of real numbers M £ R KxL , while the space of seg¬ 
ments contains sub-optimal motifs. Figure |T] provides a hint 
for the comparison between restricted motifs (M £ S ) and 
un-restricted optimal ones. From a geometrical perspective 
the segments and the motifs are points in an L-dimensional 


Algorithm 1 BruteForceMotifSearchQ 


1 : 

2 : 

3: 

4: 

5: 

6 : 

7: 

8 : 

9: 

10 : 

11 : 

12 : 

13: 

14: 

15: 

16: 

17: 

18: 

19: 

20 : 

21 : 

22 : 

23: 

24: 

25: 

26: 

27: 

28: 

29: 

30: 


Input: Threshold T £ R + , Motif length L £ N + , 

Number of Motifs K £ N + , Segments S £ R JxZ ' 
Output: M £ R KxL 
// Precompute frequencies of all segments 

for j = 1, ..., J do 

lastMatchlndex < -oo 

for r = 1, ..., J do 

if |||SV, : — Sr ,: 11 2 < T then 
// Avoid trivial matches 
if r — lastMatchlndex > 1 then 
J~ j i — J~-j £ I 
end if 

lastMatchlndex «— r 

end if 
end for 
end for 

// Select top-K motifs 
for fc = 1,... ,K do 
bestj «— 0 
for j = 1,... ,J do 

// Check if the j-th segment is diverse 
if 11 Sj,-. — Mfvlll > 2 T, Mp = k — 1,..., 1 then 
if Tbest, > Tj then 
best,- «— j 
end if 
end if 
end for 
Mk,-. £~ Sbestj,: 
end for 
return M 


space. In the example of Figure [l] the segments and motifs 
have a length of 2, thus the scenario is 2-dimensional. 


Search Motif (T=1) Learn Motif (T=1) 
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Figure 1: Motif found by searching (left) yields 3 
matches while learning a latent motif (right) yields 
4 matches 

The frequency of a motif M, given a threshold T, can be 
interpreted as the number of segment points (blue in the 
illustration) that lies within a radius of the threshold dis¬ 
tance from the motif (shown in red). The radius is y/T be¬ 
cause we used the squared-Euclidean distance in Equation[2] 
however this poses no problems since T is anyway a hyper¬ 
parameter of our method. The most frequently occurring 
motif is defined to be the point that covers the maximum 
number of blue points (segments) inside the circle of radius 
y/T that is centered at the motif, hence the densest geo¬ 
metrical ball 15 . The best segment-motif is shown in the 












left plot of Figure [T] and has a frequency of three. How¬ 
ever, the optimal motif is located in the right plot and has 
a frequency of four. As clearly seen, the optimal solution 
is hidden in the space of real numbers, outside the very re¬ 
stricted set of segment points. The method proposed in this 
paper learns optimal motifs lying in the real-numbers space 
through a tailored numerical optimization technique. Even 
though the aforementioned 2-dimensional example was cre¬ 
ated to awake the reader on the need for learning motifs, 
still empirical results of Section |6.2| will demonstrate that 
learning motifs yields more frequently occurring patterns, 
compared to searching them, on real-life time series. 

4.2 Smooth (Differentiable) Motif Frequency 

We are going to find the optimal motif through a mathe¬ 
matical maximization of the frequency as a function of the 
motifs. Unfortunately, the frequency of Equation [T] has two 
problems (i) it is not continuous at point — Sy : | = T 

and (ii) first derivative is zero in all other points (i.e. fre¬ 
quency is flat having values 1 or 0). Therefore we cannot 
compute the optimal motifs using gradient-based optimiza¬ 
tion. However, we can use a differentiable approximation for 
the frequency function using the Gaussian kernel of Equa¬ 
tions 


f(M) 

- Kjtt^ 

k= 1 3=1 

(4) 


— g — T — Sj,l) 

(5) 


The smooth frequency function of Equation [5] is both 
an accurate approximation to the frequency measure from 
Equation [2] but also a differentiable alternative, as illus¬ 
trated in Figure [2] (left plot). The parameter a controls the 
smoothness of the soft frequency. For optimization reasons 
(details in Section |4.4| ) the frequency sum of Equation [4] is 
divided by KJ to limit the value of T between 0 and 1. 
In terms of notation, the approximated frequency is distin¬ 
guished by a hat (T vs T). 


Hard vs. Smooth Frequency (T=1) Hard vs. Smooth Violation (T=1) 



Figure 2: Smooth vs. Hard Variants of Frequency 
(left) and Diversity Violation (right) 


4.3 Motif Diversity Violation 

As previously described in Equation [3] the motifs need to 
be distant by a margin of 2 T. We call such a property as mo¬ 
tif diversity. In that line, this section is devoted to formaliz¬ 
ing a differentiable penalty function for the violations of the 
distances among motifs from the diversity threshold of 2 T. 
As a first step, the distance between two motifs Mk }[ £ R 1, 


and M p £ R L is defined as (fk, P '■ (R L x R l ) —> R and 
formalized in Equation [6] 


L 

4>k, P = ^2 ( Mt,i — Mp,i) 2 ( 6 ) 

i=i 

The distance (j>k, P of any pair of motifs Mk, : ,M Pt - should 
obey to the diversity constraint shown in Equation [7] 


4>k, P > 2T, Vfc£ ({!,..., K}, Vp€{k + 1,..., K}) (7) 


We introduce the concept of diversity violation by Equa¬ 
tions 8|9 For each of the pairs of motifs, the vio¬ 

lation isO if the distance between the pair motifs is greater 
than 2T. Otherwise, if the distance is zero then the mo¬ 
tifs are identical (hence not at all diverse) and a maximum 
violation of one is returned. For all the distances between 
0 and 2 T a linear violation between 0 and 1 is returned as 
formalized in Equation |9l The constant term K (2_i) makes 
sure that the violation function has a range between 0 and 
1, the same range as the approximative frequency. 


V(M) = 

2 

J2 Vk ’P 

k= 1 p=k-\- 1 

(8) 

K(K - 1) 

Vfc,p — 

f 1 

J x 2 T 

<l>k,p < 2 T 

(9) 

1° 

(Pk,p > 2 T 


Despite achieving its aim, the violation penalty of Equa¬ 
tions |8|9| still it suffers in terms of differentiability at the 
point (f>k, P = 2 T. Therefore, we are proposing a smooth 
and differentiable variant of the violation penalty in Equa¬ 
tions M by squaring the hard violation of Equation [9] 

= K(K - 1) ^ ^ ^,r (!0) 

v ' k= 1 p=k +1 

= J(l -%f) 2 ^,p<2 T (n) 

\o <j>k, p > 2 T 

As in the case of the frequency, we denote the smooth 
approximative version of the violation penalty by a hat (V 
for hard and V for smooth). The violation penalty as a 
function of the distance between motif pairs is depicted in 
the right plot of Figure [2] 

4.4 Motif Learning Through Optimization 

This section fuses the smooth motif frequency and smooth 
motif diversity violation into a meaningful objective func¬ 
tion. Our aim is to learn a set of K motifs that maximize 
the frequencies and minimize (have no) violations. Such an 
objective can be elegantly constructed as the maximization 
task of Equation |12| 


V(M) 


M* = argmax O(M) 

M 

= argmax -F(M) — V(M) (12) 

M 


The universally optimal motifs are those which achieve the 
universal maximum value of our objective function O(M) = 
J-{M) — V(M). As both terms are positive, the objective is 
maximized for the highest motif frequencies and zero viola¬ 
tions. In this paper we will optimize the objective function 



















through gradient ascent motif updates in a series of itera¬ 
tions. Since both ranges of T and V are between 0 and 1, 
no term over-scales the other and the overall learning does 
converge. In our preliminary experiments we found out that 
a trade-off coefficient (3 in the form J-{M) — /3V(M) was not 
needed as both terms converge quickly. 
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Figure 3: Top-3 motifs from the ’’Insect B” time 
series (L = 150, T = 61, rj = 0.3 ,1 = 300, a = 2) 

The output of the learning process is a set of motifs M, 
as shown in Figure [3] for the ’’Insect B” time series. In this 
illustration the top three motifs (K = 3) are shown on the 
right plots, while the matches of the motifs on the time se¬ 
ries are shown in the upper-left plot. Z-normalized versions 
of the matched segments are shown in the middle-left plot 
and the lower-left plot illustrates the per-segment smooth 
frequency scores of the motifs. 


4.5 Learning Algorithm 

Having defined the partial derivative needed for gradient 
ascent, we can present the complete learning method. Our 
method is detailed in Algorithm [2] and in this section we will 
explain the steps of the algorithm in detail. There are a set 
of hyper-parameters to the learning process, starting with 
the frequency smoothness a. The other important hyper¬ 
parameters are the number of motifs K, the threshold T and 
the motif length L, to be set by a practitioner. The learning 
rate ry and the number of iterations I are less critical hyper¬ 
parameters that control the number of steps needed until 
convergence. For small learning rates and large number of 
iterations, the convergence is safely achievable. 



Figure 4: Metamorphosis of three motifs on the 
”EOG” time series (L = 150, T = 58, rj = 0.3,1 = 
300, a = 2) 


4.4.1 Gradient Ascent Optimization 
Since the objective function of Equation [T2] is a subtrac¬ 
tion of frequency and diversity violations, the partial gradi¬ 
ent of the objective function with respect to each point l of 
any fc-th motif is decomposable as shown in Equation |13| 

dO(M) _ d.t{M) dV{M) 

dM k ,i ~ dM k ,i dM k ,i ( ’ 

The partial derivative of the smooth frequency with re¬ 
spect to the motif is computed as the first derivative of 
Equation [4] in terms of M and shown below in Equation |14| 


a f(m) 

d Mi 


k,l 


-2a 

I<JT 




(14) 


f=i 


Similarly the partial derivative of the diversity violation 
with respect to each motif’s point is defined in Equation|15| 


dV(M) 

dM kil 

dV k ,q 

dM Kl 


2 \ - dVk,q 

K(K dM ktl 

( (</>*,«- 2 T)(M k:l -M gil ) 

0 


(15) 


4> k ,q < 2 T 
(j>k,q > 2 T 


The algorithm starts with a set of motifs initialized from 
random segments and updates them in the direction of the 
partial gradients using a learning rate step size. The learn¬ 
ing rate is dynamically updated per each point of each mo¬ 
tif using an adaptive technique known as AdaGrad 9 . We 
accumulate the square of the partial gradients into accumu¬ 
lators denoted by V. In order to speed-up the updates we 
pre-compute the per-segment frequencies T k ,j and pair dis¬ 
tances <j> k , q in lines 9-12. Then every point of each motif 
M k ,i is updated in the positive direction of the derivative 
in lines 13-25. The partial gradients correspond to the ones 
previously explained in Section |4.4.1| The update of line 24 
adjusts the learning rate by the square root of the accumu¬ 
lated square gradients 9 . 

As a consequence of the gradient ascent updates, the mo¬ 
tifs undergo a metamorphosis as is shown in Figure[4]for the 
’’Full EOG” time series. The illustrative motifs are learned 
on the first 10000 non-overlapping segments of the time se¬ 
ries having length L = 150. At the beginning (Iteration 
0) the motifs are random and the corresponding frequencies 
zero, however the motifs start to take form after approxi¬ 
mately 20 iterations and converge after 40 iterations. The 
metamorphosis of the motifs is conducted such that their 
matching frequencies (lower plots) are maximized. 

4.6 Convergence of The Learning Algorithm 

The learning algorithm converges by updating the mo¬ 
tifs so that the approximative frequency is maximized and 


















































Algorithm 2 LearnMotifsQ 

1: Input: Threshold T £ R + , Motif length L £ N + , Num¬ 
ber of Motifs K £ N + , Segments S £ R Jxi , Learning 
Rate 77 £ R + , Number of iterations I £ N + , Smoothness 
a £ R+ 

2: Output: Motif M £ R XxI/ 

3: // Initialize random motifs and gradient accumulators: 


4: 

5: 

6 : 

7: 

8 : 

9: 

10 : 

11 : 

12 : 

13: 

14: 

15: 

16: 

17: 

18: 

19: 

20 : 

21 : 

22 : 

23: 

24: 

25: 

26: 

27: 


Mi- ^ 0 KxL 

// Initialize constant values: 

„ , 2 „ , -2a 

C V K(K-1)T' 2 ’ C P KJT 

II Iterate the learning method: 
for iter= 1 do 

// Precompute the per-segment occurence scores: 

Aj i- e~r Sf =1 (M fe , t-s^i) 2 yk 6 Nf ,Vj £ 

II Precompute the pair-wise motif distances: 

t- jz (M k ,i - M q: if , Vfc£Nf,Vq£Nf 

i=i 

II Update the motifs : 

for k = 1,..., K- l = 1,..., L do 

// Gradient of frequency w.r.t. the motif: 

’ 3 = 1 

// Gradient of diversity violation w.r.t. the motif: 

dv(M) _ ^ j ( <t>k,q — 2 T) ( Mk,l - Mq : i ) (f>k,q < 2 T 

9M k ,l ~°V ( j )kq > 2T 


// Gradient of the final objective w.r.t. the 

dO(M) dP(M) dV(M) 

3M kil - dM kl 8M kA 

// Update the history of gradients: 

V J- V i ( dO(M ) \ 2 
Vfc ' i Vk ’ 1 + { d M k,l ) 

// Update the motif point: 

M k , £- M k , + 


3M k , 


end for 
end for 
return M 


motif: 


aO(Af) _ df{M) dV(M) dO(M) _ df(M) 

dM kt ~ 8M kl dM ti dM kil ~ dM H 




Figure 5: Convergence on ’’Insect B” dataset (I\ = 

5 ,T= 382,7? = 0.3) 


the diversity violations minimized to zero as shown in Fig¬ 
ure [5] (left plot) for an execution on the ’’Insect B” dataset. 
It is worth noting that the inclusion of the penalty on the 
diversity violation is crucial for preserving the diversity con¬ 
straint. An experiment is shown on the right plot of Figure[5] 
In this experiment the line 24 of Algorithm [2] is edited so the 
motifs are updated only with respect to the frequency and 
not diversity violation (see plot title). As we can clearly see, 


maximizing the frequencies without penalizing diversity vi¬ 
olations causes the motifs to be similar to each other. That 
is demonstrated by the fact that the violation measure in¬ 
creases, as shown in the right plot of Figure [5] 

5. OPTIMALITY OF OUR METHOD 

The objective function of Equation [12] is not concave, 
because the frequency function is a sum of Gaussians and 
not concave. We demonstrate the non-concavity of the fre¬ 
quency function in Figure [ 6 ] using the TAO and EEG LSF5 
datasets. Here we generate all possible motifs of length 500 
using two values, (for the sake of a 3d-plot), one value for all 
the first 250 points in X-axis and another value for the last 
250 points in the Y-axis. As can be clearly seen, frequency 
is not a concave function in terms of motifs and has multiple 
local maxima. 


TAO EEG LSF5 



Figure 6: Non-concave frequency T{M) as a function 
of motif values Mi >; on TAO and EEG LSF5 time- 
series datasets, Parameters: L = 500, T = 100, a — 2 


In case of non-concave functions (or non-convex for min¬ 
imization problems), an effective cook-book solution is to 
combine gradient descent with a random-restart strategy |l 6 ] . 
In order to avoid getting stuck in local maxima, the gradient 
descent optimization is restarted multiple times with ran¬ 
dom initial values for the motifs. The run that achieves the 


highest J~{M) is selected, as is formalized in Equation 16 
where the number of restarts is denoted by R £ N. It is im¬ 
portant to recognize that we select the motifs yielding the 
highest hard frequency T. not the proxy smooth one T. The 
hard frequency T does avoid counting trivial matches in 
our implementation. 

argmax ■7 r ( M ^) 

mM, r=l 

s.t. M (r) 


M* := 


(16) 


LearnMotif() from Alg[5] 


Figure [7] illustrates the effect of 50 random restarts on the 
frequency function J-(M) values over the TAO dataset. On 
the left plot we see that the maximum values of the objective 
are reached after a few restarts. The distribution of the 
frequency values, shown in the right plot, demonstrates that 
the histogram is normally distributed. That means there is a 
normal probability that a restart will yield an optimal value 
on the right portion (maximal) of the values within. 


5.1 Runtime Algorithmic Complexity 

The runtime complexity of Algorithm [2] is determined by 
the pre-computation steps and the update steps. Comput¬ 
ing of the frequency terms has an algorithmic complexity or¬ 
der of O(RIKJL), while computing the pairwise distances 
has a computational complexity of 0(RIK 2 L). The com¬ 
putation of the partial gradients of the frequency with re- 
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Figure 7: Impact of Random Restarts on T'(M); 
”TAO” time-series dataset with hyper-parameters 

L = 500, K = 10, r) = 0.3 ,1 = 300, T = 109.6 



spect to the motifs has a complexity of O(RIKJL). Sim¬ 
ilarly the complexity of computing the gradients of the di¬ 
versity violation with respect to the motif has a complexity 
of 0(RIK 1 2 L). The overall complexity of the algorithm is 
0{RIKJL + RIK 2 L + RIKJL + RIK 2 L), which translates 
to 0(2RIK (J + K) L) ~ O(RIKJL) since K « J. The 
brute force search on the other hand, has a complexity of 
0(J 2 L) which is quadratic in terms of the number of seg¬ 
ments J. In contrast our method is linear in terms of the 
number of segments J and faster than the brute-force search 
in case RIK < J. It is worth reminding that our algorithm 
learns optimal motifs (brute-force finds non-optimal motifs) 
and the primary strength is quality at a feasible runtime. 

6. EMPIRICAL RESULTS 

6.1 Experimental Setup 

We compare the quality of the proposed methods against 
the brute-force search strategy using a battery of six time- 
series datasets from diverse application domains. In addi¬ 
tion, we employ an evaluation protocol which compares the 
frequencies of the computed motifs per different number of 
motifs, motif lengths and distance thresholds. 

6.1.1 Datasets 

• Insect B is a time series of insect behavior data and 
has a length of 73929 points [23] . 

• TAO is a long time series representing Tropical Atmo¬ 
sphere Ocean temperature measurements having 741528 
measurement!] 

• RandomWalk is a time-series dataset consisting of 
1000000 points, among which motifs at randomly se¬ 
lected time-stamps are implanted [23| . 

• EEG is a series of 1802136 continuous measurements 
from electroencephalographic sensors, measuring volt¬ 
age differences across the scalp 23] . 

• Salinity is a time series containing recordings on the 
level of oceanic salt concentration. The data has a 
length of 2324134 points and is provided by the Na¬ 
tional Oceanographic Data CenteiJ^] 

1 www.pmel.noaa.gov/tao 

^http://www.node.noaa.gov/General/salinity.html 


• EOG is the longest series in our collection consisting 
of 8099500 points. The data is collected by an Electro- 
Oculogram and represent electrical potential between 
the front and the back of a human eye [llj. 


6.1.2 Baseline 

Many motif discovery method are based on searching for 
frequent patterns among the series segments (e.g. 25 |28 „ 

0 0 |l3| , enumerated in a broader scope in Section |2j). 
While those search-based methods are successful in terms of 
scalability, data representation, on-line learning, etc..., they 
are still upper bounded in quality (a.k.a. frequency) by the 
Brute-Force search. That is trivial to show, because all the 
frequent sub-sequences those methods could find are also de¬ 
tectable by Brute-Force search. In that aspect, it is sufficient 
to demonstrate that our method is superior to Brute-Force 
searching in terms of quality (a.k.a. frequency) and that 
naturally translates into qualitative superiority against all 
the other scalable/approximate/on-line search-based meth¬ 
ods. 


6.1.3 Evaluation Protocol 

We will compare against the brute-force search algorithm 
as the most qualitative search-based baseline. Our protocol 
involves comparisons across all the parameters of both the 
searching- and learning- based methods. 

Three different number of motifs will be computed I\ £ 
{3,10,30} having two different lengths L £ {500,1000}. 
Furthermore, the threshold (T) of the experiments is cho¬ 
sen as a percentile in the distribution of distances between 
segments. To illustrate the setup, a length corresponding to 
the 1%-th percentile, (denoted Pet = 1 in Table [lj means 
that l-% of segments pairs have a pairwise Euclidean dis¬ 
tance smaller than the threshold. In that way we can com¬ 
pare our method against the brute-force search across a 
range of thresholds computed by different percentiles T £ 
{0.001%, 0.01%, 0.1%, 1%} of the pairwise distances of seg¬ 
ments. In that way we avoid hand-picking different thresh¬ 
olds values per dataset and select the threshold in a data- 
driven neutral manner. In order to ensure convergence, the 
learning rate was set to an initial value of r) = 0.1 and the 
number of iterations to I = 1000. In addition, the opti¬ 
mization was restarted R = 200 times. The segments were 
extracted from the series by sliding a window and normal¬ 
izing the clipped segment, while the window is slid by half 
of the motif length. For every combination of the number 
of motifs I \, length L and threshold T (computed from the 
percentile), three different values of frequency smoothness 
were searched a £ {1,2,3}, keeping the one yielding the 
highest T value. 

The brute-force search baseline was executed using the 
same K, L, T(Pct) combination parameters as the learning- 
based approach, and for both methods the final frequency 
T does not include trivial matches. In order to be entirely 
transparent to the research community we publicly shared 
our source code and the data used in this paper in an on-line 
repositorjj]] 

6.2 Results 

In the conducted experiments, for all the different thresh¬ 
olds T (computed through the percentile), for all the differ¬ 
ent number of motifs K and for different motif lengths L, 

'http://fs.ismll.de/publicspace/LearnMotifs/ 



















Table 1: Hard Frequencies (F(M))\ Learning Motifs (LM) vs. Brute Force Motifs (BFM) 






L = 

--500 




L=1000 


Pct=0.001 

Pct= 

0.01 

Pct= 

0.1 

Pct=l 

Pct=0.001 

Pct= 

0.01 

Pct= 

r 0.1 

Pet 

=1 


BFM 

LM 

BFM 

LM 

BFM 

LM 

BFM 

LM 

BFM 

LM 

BFM 

LM 

BFM 

LM 

BFM 

LM 

Datasets 

Top-3 (K=3) 

Top-3 (K=3j 


Insect B 

4 

9 

6 

10 

16 

45 

44 

151 

4 

11 

4 

13 

9 

27 

19 

51 

TAO 

12 

24 

29 

45 

86 

119 

313 

429 

10 

12 

18 

35 

56 

98 

219 

284 

RandomWalk 

25 

43 

74 

125 

239 

321 

697 

855 

9 

23 

27 

64 

114 

165 

327 

458 

EEG LSF5 

17 

42 

47 

101 

150 

199 

388 

442 

11 

34 

27 

73 

96 

125 

232 

238 

Salinity 

39 

48 

151 

184 

497 

590 

1462 

1718 

18 

32 

72 

94 

269 

330 

683 

876 

EOG 

153 

190 

504 

669 

1646 

2168 

4957 

8042 

67 

102 

196 

340 

676 

1390 

2171 

5998 

Datasets 

Top-10 (K=10) 

Top-10 (K=10) 

Insect B 

11 

18 

14 

23 

35 

78 

81 

189 

11 

29 

11 

28 

17 

54 

46 

97 

TAO 

30 

48 

62 

95 

192 

314 

780 

1164 

18 

29 

44 

55 

112 

203 

344 

584 

RandomWalk 

40 

79 

132 

206 

313 

579 

1314 

1502 

23 

48 

52 

118 

223 

310 

603 

768 

EEG LSF5 

42 

109 

131 

273 

400 

557 

1118 

1266 

32 

96 

84 

212 

234 

379 

634 

810 

Salinity 

100 

105 

291 

358 

1000 

1149 

2797 

2995 

47 

59 

136 

198 

456 

597 

1222 

1564 

EOG 

263 

283 

973 

1296 

3128 

4130 

11181 

13439 

122 

164 

417 

685 

1552 

2206 

4321 

5729 

Datasets 

Top-30 (K=30) 

Top-30 (K=30) 

Insect B 

31 

40 

36 

47 

68 

107 

200 

221 

32 

49 

32 

49 

42 

72 

89 

110 

TAO 

65 

95 

133 

209 

432 

698 

1720 

2193 

38 

55 

65 

93 

202 

336 

577 

932 

RandomWalk 

61 

117 

158 

279 

471 

764 

1778 

2249 

45 

87 

83 

174 

256 

421 

989 

1151 

EEG LSF5 

110 

281 

275 

646 

850 

1442 

2541 

3505 

72 

205 

153 

428 

417 

879 

1304 

1914 

Salinity 

162 

199 

428 

540 

1260 

1456 

3270 

3855 

91 

107 

233 

284 

660 

779 

2038 

2150 

EOG 

427 

557 

1494 

2028 

5200 

5681 

17442 

17075 

247 

338 

787 

1186 

2306 

2955 

6227 

7349 

Wins 

0 

18 

0 

18 

0 

18 

1 

17 

0 

18 

0 

18 

0 

18 

0 

18 


the motifs learned through our method almost always had 
a higher frequency than the ones found through brute-force 
search. Table [I] displays the empirical results comparing the 
frequency score of the optimal learned motif (denoted LM) 
against the motifs found through brute-force search (denoted 
BFM). 

The results of Table |T] indicate that learning the motifs 
(LM) is better than searching (BFM) them in 99.31% of 
the experiments (143/144). The improvement arising from 
learning motifs (LM) in terms of motif frequencies is in aver¬ 
age 67±56% better than the search-based approach (BFM). 
The famous Bland-Altman plot is used to assess the signif¬ 
icance of the improvements. Figure [8] (left plot) shows the 
dominating ratio of LM through least-squares fitting. More¬ 
over, the right plot shows that the difference LM-BFM and 
its standard deviations are above zero, thus we have a sig¬ 
nificant difference in terms of frequencies. 




Mean BFM & LM (log(F)) 


Figure 8: Bland-Altman plot showing significance 
of LM vs BFM frequencies (log-scale for visual com¬ 
prehension) 

Even though the proposed method is significantly better 


in quality that the search-based alternatives, it is not the 
fastest method in the literature (which we never claimed). 
Yet, it is feasible in terms of run-time, since learning the 
Top-30 motifs of Insect_B (smallest dataset) took 4.7 min¬ 
utes, while learning the Top-30 motifs of EOG (largest dataset) 
took 33.57 hours, in a cluster having Intel Xeon E5-2670v2 
processors with speed 2.50GHz. 

7. CASE STUDY: AUDIO MOTIFS 

In this case study we extract motifs from audio hies. The 
case discussed in this thread is a poem by Edgar Allen Poe, 
titled ’’The Bells” and famous for its onomatopoeic nature in 
terms of repeating the word ’’Bells”. We extract a time-series 
representation of the audio hie through the hrst channel of 
the Mel-frequency cepstral coefficients (MFCC). For the sake 
of illustration we took the hrst 300000 measurements of the 
original WAV hie, corresponding to a 68 seconds audio read¬ 
ing of the poem. 

Figure [9] illustrates shows the MFCC representation time- 
series together with the results of the brute force search algo¬ 
rithm in blue and our proposed method in red. We extracted 
three motifs K = 3 of length L = 300 for both methods. 
The distance threshold used in the experiment is the 0.1%- 
th percentile of pair-wise segment distances corresponding 
to a value of T = 171.56. For each method, we display the 
location of the motif matches over series segments with a 
filled oval mark. Under the plots of the matches we show 
the found motifs together with the corresponding frequen¬ 
cies. For the same distance threshold, the learned motifs 
have totally 50 matches while the searched motifs have 35 
matches, for an improvement of 42% in terms of frequency. 
Our method learns patterns that for exactly the same dis¬ 
tance threshold match more frequently than the brute-force 
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Figure 9: Learning 3 audio motifs on a read version 
of the ’’The Bells” poem from Edgar Allan Poe. The 
method parameters are: T = 171.56 ( Pet = 0.1%), L = 
300, a = 3, r? = 0.3, / = 1000, R = 4. 


motifs. 

An investigation of the motif sounds reveals that the top- 
K repetitive sounds are different pronunciations of the word 
bell. All the motifs are different from each other by 2T, 
so they are all legit motifs by definition. Let us analyze 
how optimality translates in concrete terms. For instance 
we can consider the segment between points 10000-15000 in 
the times series, which corresponds to the following poem 
text: 

...Of the rapture that impels 
To the swinging and the ringing 
Of the bells, bells, bells - 
Of the bells, bells, bells, bells, 

Bells, bells, bells - 

To the rhyming and the chiming of the bells\ ... 

Within the above segment, the brute-force motifs can find 
7,10, 7 occurrences of the word bell within a threshold T. 
Our motifs can find 9,11, 9 matches within the same inter¬ 
val and for exactly the same distance threshold T. As the 
ground truth text above indicate, there are 11 ’’bells” pro¬ 
nunciations in total. In average, given the specified thresh¬ 
old T, the brute-force motifs find similar sounds that match 
to the word Bells in 72% of the cases, the matches of our 
optimal motifs correspond to the word Bells in average on 
88 % of the cases. This is a very important detection accu¬ 
racy given that we used only the first channel of the MFCC 


representation, which is a low-resolution representation that 
encapsulates only the overall loudness of the sound. 

8. CONCLUSION 

This paper proposed a new perspective in learning time- 
series motifs. In contrast to current state of the art tech¬ 
niques which searches out motif candidates from series seg¬ 
ments, our method learns them in a principled optimiza¬ 
tion. The motif frequency is approximated as a differen¬ 
tiable function and a gradient ascent method is proposed to 
find the motif values which maximize the objective function. 
In order to avoid local optima, a random restart strategy is 
combined with the gradient ascent learning of the motifs. 

Learned optimal motifs have more segment matches than 
the motifs found through searching, for the same distance 
threshold. The optimal motifs represent latent patterns 
not necessarily present as sub-sequences in an explicit form, 
therefore can identify motifs which are in the center of the 
densest hyper-balls including segment points. Detailed ex¬ 
perimental results demonstrate that learning optimal mo¬ 
tifs always produces more qualitative motifs than searching 
them. 
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